How to make an ML model inference on KFServing from container apps (web, Spark) running on Google Cloud Kubernetes Engine?

Published in

Google Cloud - Community

25 min readNov 20, 2020

Let us say, you have an ecommerce application and/or a big data application (such as Apache Spark) running on Kubernetes platform (an open-source container orchestration system for automating deployment, scaling, and management of containerized applications). Now, you have a need to serve a pre-trained Machine Learning model, in the existing Kubernetes infrastructure using your established DevOps practices. This solution guide explores an opinionated approach to serve a machine learning (ML) model on this Kubernetes infrastructure using your established DevOps practices. The guide covers the following topics:

Develop a web application and an Apache Spark application and test them in a local development environment.
Provision a GKE environment to deploy and test the applications that perform ML model inference.
Deploy a pre-trained TensorFlow (an end-to-end ML framework) model on KFServing (serverless model inferencing platform) hosted on Google Kubernetes Engine (GKE) — a fully-managed Kubernetes container orchestration service on Google Cloud.

Why Google Kubernetes Engine (GKE)?

GKE is a fully managed Kubernetes as a service which provides cluster auto scaling, auto repair, workload & network security, integrated logging & monitoring, built-in dashboard, GPU & TPU support, container isolation via GKE sandbox, and more. . Please check the documentation for the exhaustive list of the latest features.

Why KFServing?

KFServing enables serverless inferencing on Kubernetes with containers as its underlying infrastructure. It abstracts different ML frameworks such as TensorFlow, PyTorch, and XGBoost. It supports auto scaling, scale to zero, canary rollouts, GPUs, and more.

A step-by-step approach to develop and to deploy the solution

You could keep using your existing CI/CD infrastructure and the DevOps practice to build the solution. Your Web App and the Spark job can make low latency calls, within the same Kubernetes cluster, to the ML model inference app (running in KFServing.) You could choose different machine types for the underlying kubernetes nodes for your web app, spark app, and the ML model inference app, based on your performance needs. As an example, you could use CPU optimized or Memory optimized machine types for the ML model inference application, use E machine type for the web app, and use N2 machine type for the Spark app.

The deployment architecture below is the visual representation of the solution:

The sequence diagram below shows the interaction between the web application and the ml model inference endpoint:

The sequence diagram below shows the interaction between the Spark application and the ml model inference endpoint:

The solution uses the following environments:

Development environment — A Mac (the commands that you will use have been tested on Mac.) You could use Google Compute Engine (high-performance virtual machines) or Google Dataproc (a managed Hadoop and Spark cluster)
Develop & unit test all the containerized applications locally. Perform integration test in GKE.
Interact with Google Cloud services using the Google Cloud SDK (gcloud and gsutil command line tools).
Google Cloud environment

The solution uses the following billable components of Google Cloud:

You will use Java for the web app and Scala for the spark app. You will find all the code, the build file, the commands, the input file, the pre-trained model, the application war file, and the spark application jar file in the git repository.

The solution entails the following steps:

Identify a pre-trained Machine Learning model
Inspect the SignatureDef of the model in the dev environment
Create an input data set for testing the Spark application
Deploy the model in TensorFlow Serving for the model inference in the dev environment
Develop and test a Web application in the dev environment
Create a container of the web app and test it in the dev environment. Push it to Google Container Registry
Develop and test a Spark Application in the dev environment
Build Spark images and push them to Google Container Registry
Provision a Google Kubernetes Engine (GKE)cluster
Deploy the Web application in GKE
Deploy the KFServing in GKE and test the sample TensorFlow flowers model
Deploy the rpm model in GKE and test model inference
Test the the Spark app and the Web app running on GKE
Monitor the application logs of GKE in Google Cloud Logging
Install all necessary softwares in the development environment

Please refer to the git repository commands.txt for the complete instructions including variables and the working directories that you will be using while building the solution.

Identify a pre-trained Machine Learning model

The first step is to gather a pre-trained ML model from your existing ML Ops pipeline. You could certainly use the same GKE infrastructure to provide a centralized development and operation platform for all of your Machine Learning model training, testing, and validation. ML and model training are usually resource intensive processes that make it difficult to run on local machines. Kubernetes leverages distributed computing to provide the required additional resources. You could use Kubeflow (the machine learning toolkit for Kubernetes) or you could selectively use only the pipeline framework, Kubeflow Pipelines, depending on your needs. For the current solution, you will use a pre-trained model that you created in “How to build an end-to-end propensity to purchase solution using BigQuery ML and Kubeflow Pipelines”.

Here is a short detour to the “How to build an end-to-end propensity to purchase solution using BigQuery ML and Kubeflow Pipelines” solution: Propensity to purchase use cases is widely applicable across many industry verticals such as Retail, Finance and more. The solution shows you how to build an end to end solution (opinionated) using Google Cloud BigQuery ML (BQML), Google Cloud AI Platform Prediction, and Kubeflow Pipelines (KFP) using a Google Analytics dataset to determine which customers have the propensity to purchase. The opinionated solution incorporates some of the best practices of SDLC specifically while developing the pipeline components. You could use the solution to reach out to your targeted customers in an offline campaign via email or postal channels. You could also use it in an online campaign via on the spot decision, when the customer is browsing your products in your website, to recommend some products or trigger a personalized email for the customer.

If you haven’t built the model, then you could do so now or you could use the one that we provided in the git repository. It is a TensorFlow model. BQML exported the model to a Google Cloud Storage bucket. You will use the retail propensity model (rpm) for the remainder of this solution.

Inspect the signature of the model in the dev environment

After identifying the model, you need to understand the input and the output parameters of the model. You could either gather that from the model developer or you could inspect the model yourself. You will do the latter in the current solution. You will use the knowledge of the parameters to use them in your applications (the web app and the Spark app), to create a proper input data set, and to compose the json data payload to test the rpm model inference.

You will need to do the followings in this section:

Instal the saved_model_cli tool to inspect a TensorFlow rpm model
Download the rpm artifacts
Inspect the rpm saved model SignatureDefs

Install the tool to inspect a TensorFlow model

You could install the tool, saved_model_cli in two ways either by installing the tensorflow package or building the tool from the source code. You will do the former in your development environment. You will find the commands to build using blaze in the git repo, if you want to just use the tool and not the tensorflow package.

The commands below will install saved_model_cli tool:

Below is the output of the last command:

The above output shows that you are using saved_model_cli version 0.1.0. You shouldn’t have an issue in the event you have a different version. But pay attention to the model export structure and format.

Download the rpm artifacts from Google Cloud Storage

You will now download the TensorFlow model that BQML exported to Cloud Storage. You have created the model in the previous solution “How to build an end-to-end propensity to purchase solution using BigQuery ML and Kubeflow Pipelines”. Alternatively, you could use the model that is available in the git repository.

Below is the output of the above gsutil command:

The mandatory files are saved_model.pb, variables.data-0000-of-00001 and variable.index. You could ignore the eval_details.txt and train_detail.txt those you created earlier for the previous solution with an intention to keep different versions of the rpm model. You will remove them later.

Inspect the rpm saved model SignatureDefs

You will use the saved_model_cli tool to understand the input and the output parameters of the model (SignatureDef.)

The command below will print the SignatureDef of the rpm saved model:

# prints the inputs and the outputs of the rpm saved model
$PATH_TO_SAVED_MODEL_CLI/saved_model_cli show \
  --dir $HOME/rpm/$RPM_MODEL_VER/ \
  --all

Below is the output of the above command:

The rpm model expects ‘bounces’ and ‘time_on_site’ as inputs. The model then returns three output parameters viz. ‘predicted_will_buy_on_return_visit’, ‘will_buy_on_return_visit_probs’, and ‘will_buy_on_return_visit_value’.

Create an input data set for testing the applications

You need to create an input data set, based on the SignatureDef of the rpm model, which you will use to test the Spark application. You will find the sample data set in the git repository. You will use the dataset later when you are going to test the Spark application in the development and in the GKE environment.

You will create file which contains the following information:

fullVisitorId

You will ignore the fullVisitorId column. You could use it as a key to integrate the propensity to purchase predictions with Customer Relationship Management (CRM) system data like email addresses to make customer outreach easier. For an example of this type of integration, see Salesforce Marketing Cloud audience integration. This article describes how to integrate Google Analytics 360 with Salesforce Marketing Cloud so you could use Analytics 360 audiences in Salesforce email and SMS direct-marketing campaigns.

bounces

Our rpm model needs this as an input parameter.

time_on_site

Our rpm model needs this as an input parameter.

will_buy_on_return_visit

You could use this to compare how our model inference is performing. You are using a test dataset, however in production, you won’t know this value beforehand, as the ml model is going to predict the value for us. Model performance optimization is out of scope of the current solution guide.

The below command will print sample data set:

The command below will print the SignatureDef of the rpm saved model:

# Refer to the commands.txt for detailed commands in the “4. input file to test the Spark application” section
# peek the input csv file
head -n 3 $HOME/sparkjob/rpm-hdfs-input.csv

Below is the output of the above command:

Deploy the model in TensorFlow Serving for the model inference in the dev environment

You will deploy the rpm model in TensorFlow Serving (TFServing) to get an endpoint. You will use the endpoint to check how the request/response works and to unit test the applications (the web app and the Spark app) in the development environment.

“TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out of the box integration with TensorFlow models, but can be easily extended to serve other types of models.”

The easiest ways to get started using TensorFlow Serving is with Docker.

You will need to do the followings in this section:

Install/validate Docker
Download the TFServing container
Run TFServing for the rpm model
Test the rpm inference endpoint

You will need Docker to test the rpm model inference endpoint and the web app containers that you are going to build later.

The followings commands will help you validate and/or install Docker:

Below is the output from the docker and docker-image version commands:

Download the the TFServing container:

# pull the image and ensure that the image is now available for you to use
docker pull tensorflow/serving
docker image ls tensorflow/serving

Run TFServing for the rpm model

TFServing expects that the saved models follow a certain directory structure semantic. You will restructure the rpm model that you have downloaded earlier. You will also remove the unnecessary files. Then you will run the TFServer for the rpm model. Your server will listen on port 8000. You could expose any port based on your environment needs. In case you have decided to do so, please adjust the commands to refer to the new port.

Below are the commands which does what we just described:

Below is output of the docker process command:

You will now compose a POST request, with inputs to the model viz. ‘bounces’ and ‘time_on_site’, to the TFServing container serving the rmp model:

Below is the output from the POST request:

The above output shows that the customer is going to buy the next time they visit the website. You will see this value in the ‘will_buy_on_return_visit_values’. The prediction is based on the threshold value of 0.5. And the probability is .775 (77.5%) which is in the ‘will_buy_on_return_visit_probs’ field.

Develop and test a Web Application in the dev environment

You will develop a web application, for demonstration purposes, to test the ml model inference endpoint that is already running as a container and is listening on port 8000. Please refer to UML Sequence diagram for the interaction of the web app with the ml model inference endpoint. You will use Java as backend and HTML/CSS as front end. You will export the web app as an ecommerce.war file. The ecommerce application will be available on the /ecommerce/ web path.

Alternatively, you could skip the development of the web app and use the ecommerce.war file that is available in the git repo.

You will need to do the followings in this section:

Install, configure and test Apache Tomcat
Develop and test a web application
Export the war file (name it ecommerce.war)

Install, configure and test tomcat

If you don’t have Apache Tomcat, install it on your development machine. You will run tomcat on port 8080. You could adjust the port as per your requirement.

Run the below commands to install and test tomcat:

Below is the output of the above commands:

Develop and test a web application

You will now use the HTML and the Java file from the git repo to develop a web app using your favorite IDE or development tool to build the war file (ecommerce.war.) You could convert your project to a maven project. We will not give instructions on how to use IDE/maven, etc. to develop and build the wear file, this is beyond the scope of the current solution.

You will use the following fields of the form:

ML inference host url: This is the ml inference host end point. For the development environment it will be the TFServing endpoint.
Invoke ML inference end point: If ‘no’ then the backend Java code skips the ml inference invocation. This provides a mechanism to troubleshoot
Bounces (input feature): This is an input to the model. You discovered the input parameter previously by inspecting the SigngureDef earlier.
Time on site (input feature): This is an input to the model. You discovered the input parameter previously by inspecting the SigngureDef earlier.
Submit button: Posts the form to the server.

Below is a snippet of the HTML index.html page: (complete code is available here.)

Below is the output from the index.html page:

Below is a snippet of the Java code: (complete code is available here.)

Below is the output from the backend Java code:

As you can see the backend Java app made an inference to the rpm model and printed the output of it. Please refer to the sequence diagram for details.

You can also skip the ml inference post part for troubleshooting purposes by providing ‘no’ (without the quotes) in the html form field “Invoke ML inference end point”. Below is the screenshot of the response for the http post:

As you can see the backend Java app skipped the rpm model inference. This comes handy while troubleshooting when you want to test only the Java Servlet.

Export the war file

After you have finished developing and unit testing the app, export the war file for us to continue to create a container of the app. Alternatively, you could use the war file available in the git repository.

Refer to be commands and instructions below:

Create a container of the web app and test it in the dev environment. Push it to Google Container Registry

Now that you have a .war file, you will create a docker image, run a container with the image, test the code in the container, and push the image to the Container Registry.

Use the commands below to create a container with the webapp war and run the image:

Below is the output of the above commands:

You can see that ‘catalina.sh’ (tomcat server) is running and is listening on port 8080.

Now that you have your container up and running, you will test the application. You could test either using a browser or using a common line tool such as curl. We illustrate both ways.

To test the application using a browser:

Launch a browser go to the website http://{MY_DOCKER_HOST}:8080/ecommerce/
Enter ‘yes’ in the ‘Invoke ML inference end point’ field.
Fill in the “ML inference host url:” field with the ml inference endpoint.

You will find the screenshots of the web application in the “Develop and test a web application” section.

To test the application using curl, use the following commands:

Below are the output of the curl commands:

The above output shows that the Java Servlet invoked the rpm model inference.

The above output shows that Java Servlet skipped the rpm model inference.

You will now push the image to the Google Container Registry.

Run the commands below to push the webapp image to the Container registry.

Below is the output of the tag and the push commands:

Your output for the docker push varies if you are executing the command for the first time.

Develop and test a Spark Application in the dev environment

You will develop a spark application, for demonstration purposes, to test the rpm model inference point that is already running as a container and is listening on port 8000. Please refer to UML Sequence diagram for the interaction of the scala app with the ml inference endpoint. You will use Scala for the application. You will export the application as a jar file. The tool that you are going to use will name the jar file as mlinference_2.12–1.0.jar.

You could skip the development of the scala app and use the jar file that is available in the git repo.

You will need to do the followings in this section:

Install and sanity check Scala
Install and sanity check sbt
Install and sanity check Spark
Develop and test the Spark App

The commands below installs and sanity check Scala install:

Below is the output of the Scala compile and run commands:

The commands below installs and sanity check sbt install:

Below is the output of the sbt compile and run commands:

You will now prepare a spark working directory, download additional jars, create a proper configuration file, copy Google Cloud service account json file and test spark-shell. You will have Spark use Google Cloud Storage, instead of local HDFS, for both the application jar file and the input file, the csv file, which you created earlier.

Run the commands below to prepare Spark for the next steps:

Run the below command to launch spark-shell for a quick check for the Spark install:

Below is a screenshot of the commands with their outputs for the quick sanity check of the Spark install:

Run the below command to launch spark-shell to test a REPL Scala program for the Spark install:

Below is a screenshot of the commands with their outputs for the REPL Scala program for the Spark install:

Package the HelloWorld program to a jar file using the commands below:

# package the code to a jar and spark-submit to the local master
cd $HOME/sbttest_sanity
sbt package

Run the below command to run the Hello World program through a spark-submit:

cd $HOME/Downloads/spark_dir/spark-3.0.1-bin-hadoop2.7/bin
./spark-submit \
  --class com.mycos.test.HelloWorld \
  --master local[1] \
$HOME/sbttest_sanity/target/scala-2.12/sbttest_sanity_2.12-0.1.0-SNAPSHOT.jar

Below is the screenshot of the spark-submit command with their output to test the Spark by submitting a HelloWorld Spark job:

You will now develop a scala program which will read the input file containing the sample data that you created earlier, and invoke the ml inference point for each of the input records. You could optimize the program to partition the data in Spark and submit the partitioned data to predict in batch mode rather than online mode. However, that is out of the scope of the current solution.

The scala program expects the following arguments:

ML inference host url: This is the ml inference host end point. For the development environment it will be the TFServing endpoint.
Invoke ML inference end point: If ‘no’ then the backend Java code skips the ml inference invocation. This provides a mechanism to troubleshoot
Input file path: This is the full path to the sample data in in Google Cloud Storage

In addition to the above arguments the spark-shell expects certain variables. We will explain them later.

You will do the followings below:

Develop, compile, and package the spark application jar file
Upload the jar file to a Google Cloud Storage bucket
Upload the rpm-hdfs-input.csv file to the Storage bucket
Test spark-submit with the proper arguments

You could skip to develop the jar file and could use the one available in the git repo.

Follow the instructions below to develop, compile, and package the scala applications:

Below is a snippet of the Scala program: (complete code is available here.)

Below is a snippet of the sbt file: (complete code is available here.)

Follow the instructions below to upload the input file with sample data and the jar file to Google Cloud Storage:

Follow the instructions below to test the spark-job in the development environment:

The scala job reads the input csv from the Google Cloud Storage, iterates through each row, prints it, and (optionally, depending on the ‘yes’ or ‘no’ argument) invokes the ml model inference endpoint.

Below is the output of the spark-submit job:

The above output shows the following input rpm data:

The entire dataset in rpm-hdfs-input.csv
Each row of the above dataset
“Skipped the ML inference invocation” message (because you passed ‘no’ in the argument)

The above output shows the following input rpm data:

The entire dataset in rpm-hdfs-input.csv
Each row of the above dataset
Predictions from the rpm model inference (because you passed ‘yes’ in the argument)

Build Spark images and push them to Google Container Registry

In this section you will create two spark images.You will create a Spark base image and a Spark image, which extends the base image, with Google Cloud Storage Connector with additional jar files. You could have created a single image with additional jars but that might have created a challenge in troubleshooting. Thus you will keep two images for better unit testing.

The commands below will create two the images and push them to Google Container Registry:

Provision a Google Kubernetes Engine (GKE) cluster

You have developed and unit tested the web application, and the spark application. Now you are going to deploy it in GKE for performing integration testing.

You will do the followings:

Provision GKE cluster and the workload specific node pools
Grant proper permissions
Deploy the web application
Deploy and test the Spark job using the Spark images from the registry
Deploy the KFServing and test the sample TensorFlow flowers model
Deploy the rpm model

Provision the GKE cluster and the node pools

You will now create a GKE cluster and multiple node pools for different Kubernetes workload types. You could change the machine types for each node pool,in the commands to suit your needs. You will deploy the web app in the “webapp-pool” node pool and the rpm model in the “kfserving-pool” node pool. The spark jobs are deployed to the default node pool.

Please find the full command in the git repository. You need Knative (Kubernetes-based platform to deploy and manage modern serverless workloads) for KFServing. Since Cloud Run is a fully managed serverless platform provided out of the box by Google, you will be using that as the Knative component.. In order to test the readiness of the test model deployed to KFServing you require an external end-point which is the istio-ingress provided to you as part of Cloud Run install.

Below are the commands to do so:

Grant proper permission

You will configure workload identity. It is the recommended way to access Google Cloud services from applications running within GKE due to its improved security properties and manageability. The Spark job needs to access the Google Storage Account which hosts the input file and the application jar file.

Run the following commands to setup the workload identity:

Pin and taint the nodes

You will now pin and taint the webapp and kfserving nodes so that the application uses the desired machine types that you already created when you provisioned GKE previously.

You will create a namespace and grant proper permission before proceeding to the next steps.

You will run the commands below to pin and taint the nodes:

Create a namespace and test the Spark images

You will create a namespace and grant proper permission before proceeding to the next steps. You will also set up the work-load identity.

Below are the commands to do so:

You have used the ‘cluster-admin’ role, for our development purpose, however for production, we suggest you follow the principle of least privilege.

Now that you have created the namespace, granted proper access, and pushed the Spark images to the Google Container Register, you will unit test them using the out of the box “Pi” example.

The commands below test both the images with the out of the “Pi” example:

Below is the output of the spark image create command:

Below is the output of the container log:

The above output shows that ‘Pi is roughly 3.139155695778479.’ You will get similar output for both the spark-submit one with ‘spark:v3.0.1’ and the other with ‘spark_gcs:v3.0.1’ image.

Deploy the web app in GKE

You will create a namespace, deploy the web application in the cluster, and expose port 8080. You will use port 8080 to access the web app from a web browser in your development environment.

You will find the webapp.yaml file in the git repository. You will need it to deploy it.

Below are the commands the you will use to deploy the web application:

Deploy the KFServing in GKE and test the sample TensorFlow flowers model

Now you will deploy the cert manager which is a prerequisite as explained in the KFServing git repository.

The below command deploys the cert manager:

cp $GIT_CLONE_HOME_DIR/gke_deploy_dir/cert-manager.yaml .
# deploy cert-manager
kubectl apply --validate=false -f $HOME/gke_deployment/cert-manager.yaml

Below is the output of the above command:

You will deploy v1beta1 version of the KFServing. You will use the modified version of the kfserving.yaml file, available in the git repo to your cluster. Please refer to the yaml file for the precise changes made to the file in the repo. You will create a namespace for the KFServing deployment followed by deploying the yaml file. You will configure work-load identity. You will also deploy the TensorFlow flowers example that is available in the KFServing git repository. After this deployment, you will be able to sanity check the deployment.

Run the command below to deploy the example:

The commands and the instructions below does so:

Below is the output of the kubectl apply command:

After you deploy the KFServing in the GKE cluster, you can sanity check the deployment by running the following commands to make an inference from the TensorFlow flowers example that is available in the KFServing git repository:

Below is the screenshot of the two curl commands along with their outputs:

The output of the first curl shows ‘state’ as ‘AVAILABLE’ which means that the model is ready to serve requests. The output of the second command shows the ‘predictions’, returned by the KFServing.

Deploy the rpm model in GKE and test model inference

You will deploy the rpm model that you have downloaded earlier to the cluster. You will use the .yaml file that is available in the git repository. You will also test if the rpm model is ready to accept traffic. The commands below does that:

Below is the output of the above curl command:

Test the the Spark app and the Web app in the GKE

You will test the web app and also submit a spark job to GKE. You will test the end to end integration from both the web app and the Spark app. It involves invoking the ML model inference from both the apps.

You will start with testing the Scala application by submitting it to the Spark in GKE. To test, you will to do the followings:

Prepare a directory with the Spark binaries including spark-submit
Setup certificate, token, and secret to connect to the GKE cluster
Finally, spark-submit the Spark job

Below commands does the above mentioned tasks:

You will find the screenshots of the output of the Spark application in the “Develop and test a Spark Application” section.

To test the web application using a browser:

Launch a browser go to the website http://<EXTERNAL_IP>:8080/ecommerce/
Enter ‘yes’ in the ‘Invoke ML inference end point’ field.
Fill in the “ML inference host url:” field with the ml inference endpoint. You have already collected the local docker host in the following variable $MLINFER_ENDPOINT_INTERNAL.

Below is the command that gives you the external ip:

# gather the external ip
export EXTERNAL_IP=`kubectl get svc -n $WEBAPP_NAMESPACE -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}'`
echo $EXTERNAL_IP

You will find the screenshots of the web application in the “Develop and test a web application” section.

Monitor the application logs of GKE in Google Cloud Logging

You will monitor all of your application logs in the Google Cloud Logging. You need to locate the container that executed your job.

You will find the last run container by issuing the following command:

kubectl logs -f `kubectl get pods -n $SPARK_NAMESPACE \
  --sort-by=.metadata.creationTimestamp | grep driver \
  | tail -1 | awk -F ' ' '{print $1}'` -n $SPARK_NAMESPACE

Go to Google Cloud Console → Kubernetes Engine → workload → <spot the container that ran the job> → Logs → Container Logs

Next Steps

Pay attention to the security:

— You have used HTTP endpoint for both the web app and the spark app. Setup and use SSLs.

— Check “Hardening your cluster’s security”.

— Check the “Security overview”.

Invoke the ML inference endpoint on the Web client side:

Register the host url of the service provided by knative in your DNS with the public ingress ip. The host value is in the variable MLINFER_ENDPOINT_EXTERNAL. The public IP is in the variable INGRESS_HOST.

Use data partition and batch prediction:

You have iterated through the input data and for each input data row you have invoked the ml inference endpoint. You have used online prediction for the ml inference. You could partition the data for the Spark job. You could then do batch prediction for better performance.

Write the prediction results to a permanent storage:

You have printed the prediction results on the console in the spark app. You could write the results in a permanent storage.

Deploy different versions of the model and do A/B testing to find out the new models efficacy:

You have deployed just one version of the rpm model. You could deploy multiple versions of the rpm model. You could test them on real production by conducting A/B testing to find out which model version works better.

Use Spark Operator:

You have used a scaled down version of Spark for demonstration purposes. You might be already running Spark operator for a full fledged Spark cluster. If not, then you could consider the Spark operator

Use skaffold:

You have used docker in the local environment to test out the web application, the spark application, and the rpm model serving. You could test your applications in GKE directly using Skaffold. “Skaffold handles the workflow for building, pushing and deploying your application, allowing you to focus on what matters most: writing code.”

Fold the provisioning and ci/cd portion of the article into your existing DevOps infrastructure. Use declarative YAML instead of imperative kubectl commands

You have executed shell and gcloud commands for provisioning the GKE infrastructure and for building Docker images. You have used kubectl operation to modify the cluster e.g. ‘kubectl expose deployment’. You could create the deployment and the service YAML and then use kustomize to deploy it very easily with one command. We suggest that you automate these steps and write Infra as a Code and incorporate them in your existing pipeline infrastructure.

Use gRPC instead of HTTP endpoint:

You have used the HTTP in the solution, however for better performance we suggest you use gRPC. Please check KFServing roadmap for the support of the feature.

Summary

Congratulations!!! You have made it to the end of the solution. You have learnt how to make an ML model inference running in the KFServing from the web app and the spark app where both the apps and the KFSeriving are running the GKE environment. We hope that you enjoyed the comprehensive solution and hope that you could use the knowledge and the code snippet in your projects.

Want more?

Please leave me your comments with any suggestions or corrections.

About me: I work in Google Cloud. I help our customers to build solutions on Google Cloud. Here is my linkedin profile.

Gratitudes

Special thanks to Praveen Rajagopalan for co-authoring the GKE and KFServing part of the solution along with the respective deployment and testing commands. Thanks to my colleagues Rajesh Thallam and Ameer Abbas who reviewed the solution.

How to make an ML model inference on KFServing from container apps (web, Spark) running on Google Cloud Kubernetes Engine?

A step-by-step approach to develop and to deploy the solution

Identify a pre-trained Machine Learning model

Inspect the signature of the model in the dev environment

Install the tool to inspect a TensorFlow model

Download the rpm artifacts from Google Cloud Storage

Inspect the rpm saved model SignatureDefs

Create an input data set for testing the applications

Deploy the model in TensorFlow Serving for the model inference in the dev environment

Develop and test a Web Application in the dev environment

Install, configure and test tomcat

Develop and test a web application

Below is the output from the backend Java code:

Export the war file

Create a container of the web app and test it in the dev environment. Push it to Google Container Registry

Develop and test a Spark Application in the dev environment

Build Spark images and push them to Google Container Registry

Provision a Google Kubernetes Engine (GKE) cluster

Provision the GKE cluster and the node pools

Grant proper permission

Pin and taint the nodes

Create a namespace and test the Spark images

Deploy the web app in GKE

Deploy the KFServing in GKE and test the sample TensorFlow flowers model

Deploy the rpm model in GKE and test model inference

Test the the Spark app and the Web app in the GKE

Monitor the application logs of GKE in Google Cloud Logging

Next Steps

Summary

Want more?

Gratitudes

Published in Google Cloud - Community

Written by Damodar Panigrahi

No responses yet